feat(bench): live observe→steer join (real worker + real observer) by drewstone · Pull Request #195 · tangle-network/agent-runtime

drewstone · 2026-06-08T14:24:37Z

What

Adds bench/src/cloud-loop.mts — the observe→steer loop closed on live endpoints:

round → REAL cloud worker (openSandboxRun, opencode in a box) over task + accumulated steers
      → its real event trace
      → observe() with a REAL router LLM reads the trace → an AnalystFinding
      → finding.recommended_action injected as the next round's steer
      → stop on the deterministic verifier, or budget

Why

PR #194 shipped observe-steer-workspace-loop.mts, which proves the join through a mock observer (transport:'mock', canned findings) feeding a canned worker — "the grammar talking to itself," the exact pattern docs/research/loop-facade-postmortem.md (also in #194) warns against. This proves the same join with both ends real.

The two are complementary surfaces, not duplicates:

observe-steer-workspace-loop.mts — exercises the Scope/Supervisor/coordination-MCP/git-workspace plumbing (mock ends, deterministic, no creds).
cloud-loop.mts (this PR) — exercises the live worker + live observer path (openSandboxRun + observe()).

Status (honest)

The join ran live end-to-end for 3 rounds (real worker → real trace → real router-LLM finding → real steer injection). Re-runs are currently blocked at provisioning by a sandbox egress regression: router.tangle.tools returns CONNECT-403 from inside the box (only that host — id/pangolin/sandbox.tangle.tools and api.openai.com all pass). It worked 2026-06-06 → platform regression, tracked as ops-board #984. So this proves the live join; efficacy (does the steer improve behavior at equal budget) is gated on that unblock.

Follow-up (recommendation, not in this PR)

Once #984 is unblocked, run for efficacy. Separately, consider converting observe-steer-workspace-loop.mts into a real CI unit test under tests/loops/ (it currently runs as a standalone tsx demo and asserts nothing), or retiring it now that the live join exists.

Test

Code is byte-identical to the version that ran live for 3 rounds (only the header docstring changed). bench/** is outside the root biome scope (consistent with sibling fleet.mts/workspace-loop.mts); build is verified-by-execution.

…r observer The merged observe-steer-workspace-loop.mts proves the join through a mock observer (transport:'mock', canned findings) and a canned worker — the grammar talking to itself, which docs/research/loop-facade-postmortem.md warns against. This closes the same join on LIVE endpoints: a real cloud opencode worker (openSandboxRun) produces a real event trace, observe() reads it with a real router LLM, and the finding's recommended_action is injected as the next round's steer. The join ran live end-to-end for 3 rounds. Re-runs are currently blocked at provisioning by a sandbox egress regression (router.tangle.tools CONNECT-403 from inside the box; only that host — every other tangle host + provider egress passes), tracked as ops-board #984. So this proves the live JOIN; efficacy (does the steer improve behavior at equal budget) is gated on that unblock.

tangletools · 2026-06-08T14:30:06Z

✅ No Blockers — `302af97e`

Readiness 76/100 · Confidence 65/100 · 6 findings (1 medium, 5 low)

	deepseek	glm	aggregate
Readiness	76	89	76
Confidence	65	65	65
Correctness	76	89	76
Security	76	89	76
Testing	76	89	76
Architecture	76	89	76

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM observer call has no timeout/abort signal — bench/src/cloud-loop.mts

L117-120: await observe(...) is called without an AbortSignal. The per-round 240s timeout on L88 (setTimeout(() => controller.abort(), 240_000)) only aborts openSandboxRun, and the timer is cleared on L109 before observe runs. If the router LLM inside observe hangs (network stall, model degradation), the script hangs indefinitely with no timeout. The controller is still in scope — pass signal: controller.signal into the observe options to give it a hard deadline. The observe() function forwards opts.signal to chat.chat() (observe.ts:150), so the plumbing already exists.

🟡 LOW AbortSignal not propagated to observe() call — bench/src/cloud-loop.mts

Line 111: observe(...) accepts opts.signal (ObserveOptions.signal exists per src/runtime/observe.ts:46) but the cloud-loop does not pass controller.signal. If the observe LLM call hangs, the round cannot be cancelled. The overall loop is bounded by ROUNDS, so this is not a hang risk, but it means a timed-out round's observer call continues burning tokens after the worker was already aborted. Pass { chat, model, signal: controller.signal } to observe.

🟡 LOW Final status message checks current steers, not cumulative history — bench/src/cloud-loop.mts

Line 125: steers.length ? 'steered ' : '' reflects only whether the LAST round had steers (steers is cleared and refilled each round at line 117-118). If the observer returned findings in round 2 but not round 3, the final message says 'rounds' without 'steered', which is misleading. Cosmetic only. Track a boolean everSteered if accurate reporting matters.

🟡 LOW no test coverage — bench/src/cloud-loop.mts

No tests exist for this file. The vitest config (vitest.config.ts:5) excludes bench/** entirely, so even if tests were written they wouldn't run in CI. This is a bench/tooling script by design, but the verify() and tools() functions are pure and testable. Consider extracting them to a testable location or adding an integration check gated on env vars.

🟡 LOW observer failure unhandled — crashes the loop — bench/src/cloud-loop.mts

L117-120: observe() is called outside the try/catch that protects the sandbox run (L91-108). If the router LLM returns a malformed JSON response, or the network errors, observe() throws → bypasses the catch on L105 → propagates to main().catch() on L135 → logs and exits 1. The per-round error handling pattern (log, continue) is broken for the observer leg. Wrap in try/catch and continue on failure so a transient router blip doesn't kill the whole bench run.

🟡 LOW unnecessary as never type assertion — bench/src/cloud-loop.mts

L96: fromEvents: (e) => answerOutput.parse(e as never). The e parameter is typed SandboxEvent[] from Deliverable<'events'>. answerOutput.parse accepts ReadonlyArray<unknown> per OutputAdapter<string> (experiment.ts:45). SandboxEvent[] is assignable to ReadonlyArray<unknown> without any cast. The as never is dead code — remove it. If the cast was suppressing a real type error, the root cause should be fixed instead of papered over.

_{tangletools · 2026-06-08T14:30:03Z · trace}

tangletools

✅ Approved — 6 non-blocking findings — `302af97e`

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-08T14:30:03Z · immutable trace}

tangletools approved these changes Jun 8, 2026

View reviewed changes

drewstone merged commit 4917ef6 into main Jun 8, 2026
1 check passed

drewstone deleted the feat/loops-live-join branch June 8, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): live observe→steer join (real worker + real observer)#195

feat(bench): live observe→steer join (real worker + real observer)#195
drewstone merged 1 commit into
mainfrom
feat/loops-live-join

drewstone commented Jun 8, 2026

Uh oh!

tangletools commented Jun 8, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 8, 2026

What

Why

Status (honest)

Follow-up (recommendation, not in this PR)

Test

Uh oh!

tangletools commented Jun 8, 2026

✅ No Blockers — 302af97e

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 6 non-blocking findings — 302af97e

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `302af97e`

✅ Approved — 6 non-blocking findings — `302af97e`